stochastic learning
SEBOOST - Boosting Stochastic Learning Using Subspace Optimization Techniques
SEBOOST applies a secondary optimization process in the subspace spanned by the last steps and descent directions. The method was inspired by the SESOP optimization method for large-scale problems, and has been adapted for the stochastic learning framework. It can be applied on top of any existing optimization method with no need to tweak the internal algorithm. We show that the method is able to boost the performance of different algorithms, and make them more robust to changes in their hyper-parameters. As the boosting steps of SEBOOST are applied between large sets of descent steps, the additional subspace optimization hardly increases the overall computational burden. We introduce two hyper-parameters that control the balance between the baseline method and the secondary optimization process. The method was evaluated on several deep learning tasks, demonstrating promising results.
S2MoE: Robust Sparse Mixture of Experts via Stochastic Learning
Do, Giang, Le, Hung, Tran, Truyen
Sparse Mixture of Experts (SMoE) enables efficient training of large language models by routing input tokens to a select number of experts. However, training SMoE remains challenging due to the issue of representation collapse. Recent studies have focused on improving the router to mitigate this problem, but existing approaches face two key limitations: (1) expert embeddings are significantly smaller than the model's dimension, contributing to representation collapse, and (2) routing each input to the Top-K experts can cause them to learn overly similar features. In this work, we propose a novel approach called Robust Sparse Mixture of Experts via Stochastic Learning (S2MoE), which is a mixture of experts designed to learn from both deterministic and non-deterministic inputs via Learning under Uncertainty. Extensive experiments across various tasks demonstrate that S2MoE achieves performance comparable to other routing methods while reducing computational inference costs by 28%.
Reviews: SEBOOST - Boosting Stochastic Learning Using Subspace Optimization Techniques
The paper is overall clearly written, but one important aspect of the algorithm remains not sufficiently expounded: how precisely the subspace optimization is carried over. The paper only mentions in passing that it uses conjugate gradient (CG), but a number of points would deserve further clarification: a) is CG done over a *single* larger minibatch? And how precisely is this minibatch chosen. Which version/implementation do you use? The computational cost *and* additional memory requirement (as this can constitute a practical limitation for large nets) for the subspace optimization would need to be disclosed and made precise.
Estimating the Hessian Matrix of Ranking Objectives for Stochastic Learning to Rank with Gradient Boosted Trees
Kang, Jingwei, de Rijke, Maarten, Oosterhuis, Harrie
Stochastic learning to rank (LTR) is a recent branch in the LTR field that concerns the optimization of probabilistic ranking models. Their probabilistic behavior enables certain ranking qualities that are impossible with deterministic models. For example, they can increase the diversity of displayed documents, increase fairness of exposure over documents, and better balance exploitation and exploration through randomization. A core difficulty in LTR is gradient estimation, for this reason, existing stochastic LTR methods have been limited to differentiable ranking models (e.g., neural networks). This is in stark contrast with the general field of LTR where Gradient Boosted Decision Trees (GBDTs) have long been considered the state-of-the-art. In this work, we address this gap by introducing the first stochastic LTR method for GBDTs. Our main contribution is a novel estimator for the second-order derivatives, i.e., the Hessian matrix, which is a requirement for effective GBDTs. To efficiently compute both the first and second-order derivatives simultaneously, we incorporate our estimator into the existing PL-Rank framework, which was originally designed for first-order derivatives only. Our experimental results indicate that stochastic LTR without the Hessian has extremely poor performance, whilst the performance is competitive with the current state-of-the-art with our estimated Hessian. Thus, through the contribution of our novel Hessian estimation method, we have successfully introduced GBDTs to stochastic LTR.
Weight Space Probability Densities in Stochastic Learning: I. Dynamics and Equilibria
The ensemble dynamics of stochastic learning algorithms can be studied using theoretical techniques from statistical physics. We develop the equations of motion for the weight space probability densities for stochastic learning algorithms. We discuss equilibria in the diffusion approximation and provide expressions for special cases of the LMS algorithm. The equilibrium densities are not in general thermal (Gibbs) distributions in the objective function be(cid:173) ing minimized, but rather depend upon an effective potential that includes diffusion effects. Finally we present an exact analytical expression for the time evolution of the density for a learning algo(cid:173) rithm with weight updates proportional to the sign of the gradient.
Using Curvature Information for Fast Stochastic Search
We present an algorithm for fast stochastic gradient descent that uses a nonlinear adaptive momentum scheme to optimize the late time convergence rate. The algorithm makes effective use of cur(cid:173) vature information, requires only O(n) storage and computation, and delivers convergence rates close to the theoretical optimum. We demonstrate the technique on linear and large nonlinear back(cid:173) prop networks. Learning algorithms that perform gradient descent on a cost function can be for(cid:173) mulated in either stochastic (on-line) or batch form. Stochastic learning provides several advantages over batch learning.
Binary stochasticity enabled highly efficient neuromorphic deep learning achieves better-than-software accuracy
Li, Yang, Wang, Wei, Wang, Ming, Dou, Chunmeng, Ma, Zhengyu, Zhou, Huihui, Zhang, Peng, Lepri, Nicola, Zhang, Xumeng, Luo, Qing, Xu, Xiaoxin, Yang, Guanhua, Zhang, Feng, Li, Ling, Ielmini, Daniele, Liu, Ming
Deep learning needs high-precision handling of forwarding signals, backpropagating errors, and updating weights. This is inherently required by the learning algorithm since the gradient descent learning rule relies on the chain product of partial derivatives. However, it is challenging to implement deep learning in hardware systems that use noisy analog memristors as artificial synapses, as well as not being biologically plausible. Memristor-based implementations generally result in an excessive cost of neuronal circuits and stringent demands for idealized synaptic devices. Here, we demonstrate that the requirement for high precision is not necessary and that more efficient deep learning can be achieved when this requirement is lifted. We propose a binary stochastic learning algorithm that modifies all elementary neural network operations, by introducing (i) stochastic binarization of both the forwarding signals and the activation function derivatives, (ii) signed binarization of the backpropagating errors, and (iii) step-wised weight updates. Through an extensive hybrid approach of software simulation and hardware experiments, we find that binary stochastic deep learning systems can provide better performance than the software-based benchmarks using the high-precision learning algorithm. Also, the binary stochastic algorithm strongly simplifies the neural network operations in hardware, resulting in an improvement of the energy efficiency for the multiply-and-accumulate operations by more than three orders of magnitudes.
Weight Space Probability Densities in Stochastic Learning: II. Transients and Basin Hopping Times
In stochastic learning, weights are random variables whose time evolution is governed by a Markov process. At each time-step, n, the weights can be described by a probability density function pew, n). We summarize the theory of the time evolution of P, and give graphical examples of the time evolution that contrast the behavior of stochastic learning with true gradient descent (batch learning). Finally, we use the formalism to obtain predictions of the time required for noise-induced hopping between basins of different optima. We compare the theoretical predictions with simulations of large ensembles of networks for simple problems in supervised and unsupervised learning.
SEBOOST - Boosting Stochastic Learning Using Subspace Optimization Techniques
Richardson, Elad, Herskovitz, Rom, Ginsburg, Boris, Zibulevsky, Michael
SEBOOST applies a secondary optimization process in the subspace spanned by the last steps and descent directions. The method was inspired by the SESOP optimization method for large-scale problems, and has been adapted for the stochastic learning framework. It can be applied on top of any existing optimization method with no need to tweak the internal algorithm. We show that the method is able to boost the performance of different algorithms, and make them more robust to changes in their hyper-parameters. As the boosting steps of SEBOOST are applied between large sets of descent steps, the additional subspace optimization hardly increases the overall computational burden.
Online and Stochastic Learning with a Human Cognitive Bias
Oiwa, Hidekazu (The University of Tokyo) | Nakagawa, Hiroshi (The University of Tokyo)
Sequential learning for classification tasks is an effective tool in the machine learning community. In sequential learning settings, algorithms sometimes make incorrect predictions on data that were correctly classified in the past. This paper explicitly deals with such inconsistent prediction behavior. Our main contributions are 1) to experimentally show its effect for user utilities as a human cognitive bias, 2) to formalize a new framework by internalizing this bias into the optimization problem, 3) to develop new algorithms without memorization of the past prediction history, and 4) to show some theoretical guarantees of our derived algorithm for both online and stochastic learning settings. Our experimental results show the superiority of the derived algorithm for problems involving human cognition.